Search CORE

3 research outputs found

Self Supervision Does Not Help Natural Language Supervision at Scale

Author: Gunter Tom
Katharopoulos Angelos
Shankar Vaishaal
Weers Floris
Yang Yinfei
Publication venue
Publication date: 20/01/2023
Field of study

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training

arXiv.org e-Print Archive

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Author: Daxberger Erik
Du Xianzhi
Eichner Marcin
Emmersberger Michael
Gunter Tom
Pang Ruoming
Toshev Alexander
Weers Floris
Yang Yinfei
Zhang Bowen
Publication venue
Publication date: 08/09/2023
Field of study

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%

arXiv.org e-Print Archive

Bias in Automated Image Colorization: Metrics and Error Types

Author: Bucur Doina
Stapel Frank
Weers Floris
Publication venue
Publication date: 16/02/2022
Field of study

We measure the color shifts present in colorized images from the ADE20K dataset, when colorized by the automatic GAN-based DeOldify model. We introduce fine-grained local and regional bias measurements between the original and the colorized images, and observe many colorization effects. We confirm a general desaturation effect, and also provide novel observations: a shift towards the training average, a pervasive blue shift, different color shifts among image categories, and a manual categorization of colorization errors in three classes

arXiv.org e-Print Archive

University of Twente Research Information